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1. INTRODUCTION 

Over the past few years, interest in natural language processing (NLP) [Ll] has increased significantly. 
Today, several applications are investing massively in this new technology, such as extending recommender 
systems [2], [3], uncovering new insights in the health industry [4], [5], and unraveling e-reputation and opin- 
ion mining [6], [7]. Opinion mining is an approach to computational linguistics and NLP that automatically 
identifies the emotional tone, sentiment, or thoughts behind a body of text. As a result, it plays a vital role 
in driving business decisions in many industries. However, seeking customer satisfaction is costly expensive. 
Indeed, mining user feedback regarding the products offered, is the most accurate way to adapt strategies and 
future business plans. In recent years, opinion mining has seen considerable progress, with applications in 
social media and review websites. Recommendation may be staff-oriented or user-oriented and should 
be tailored to meet customer needs and behaviors. 

Nowadays, analyzing people’s emotions has become more intuitive thanks to the availability of many 
large pre-trained language models such as bidirectional encoder representations from transformers (BERT) [9] 
and its variants. These models use the seminal transformer architecture [10], which is based solely on attention 
mechanisms, to build robust language models for a variety of semantic tasks, including text classification. 
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Moreover, there has been a surge in opinion mining text datasets, specifically designed to challenge NLP 
models and enhance their performance. These datasets are aimed at enabling models to imitate or even exceed 
human level performance, while introducing more complex features. 

Even though many papers have addressed NLP topics for opinion mining using high-performance 
deep learning models, it is still challenging to determine their performance concretely and accurately due to 
variations in technical environments and datasets. Therefore, to address these issues, our paper aims to study 
the behaviour of the cutting-edge transformer-based models on textual material and reveal their differences. 
Although, it focuses on applying both transformer encoders and decoders, such as BERT [9] and generative 
pre-trained transformer (GPT) [11], respectively, and their improvements on a benchmark dataset. This enable 
a credible assessment of their performance and understanding their advantages, allowing subject matter experts 
to clearly rank the models. Furthermore, through ablations, we show the impact of configuration choices on 
the final results. 


2. BACKGROUND 
2.1. Transformer 


The transformer [10], as illustrated in Figure[]] is an encoder-decoder model dispensing entirely with 
recurrence and convolutions. Instead, it leverages the attention mechanism to compute high-level contextual- 
ized embeddings. Being the first model to rely solely on attention mechanisms, it is able to address the issues 
commonly associated with recurrent neural networks, which factor computation along symbol positions of in- 
put and output sequences, and then precludes parallelization within samples. Despite this, the transformer is 
highly parallelizable and requires significantly less time to train. In the upcoming sections, we will highlight the 
recent breakthroughs in NLP involving transformer that changed the field overnight by introducing its designs, 
such as BERT [9] and its improvements. 
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Figure 1. The transformer model architecture 


Int J Artif Intell, Vol. 12, No. 4, December 2023: 1995-2010 


Int J Artif Intell ISSN: 2252-8938 i) 1997 


2.2. BERT 


BERT [9] is pre-trained using a combination of masked language modeling (MLM) and next sentence 
prediction (NSP) objectives. It provides high-level contextualized embeddings grasping the meaning of words 
in different contexts through global attention. As a result, the pre-trained BERT model can be fine-tuned for a 
wide range of downstream tasks, such as question answering and text classification, without substantial task- 
specific architecture modifications. 

BERT and its variants allow the training of modern data-intensive models. Moreover, they are able to 
capture the contextual meaning of each piece of text in a way that traditional language models are unfit to do, 
while being quicker to develop and yielding better results with less data. On the other hand, BERT and other 
large neural language models are very expensive and computationally intensive to train/fine-tune and make 
inference. 


2.3. GPT-L, UL, UI 


GPT is the first causal or autoregressive transformer-based model pre-trained using language 
modeling on a large corpus with long-range dependencies. However, its bigger an optimized version called 
GPT-2 [12], was pre-trained on WebText. Likewise, GPT-3 is architecturally similar to its predecessors. Its 
higher level of accuracy is attributed to its increased capacity and greater number of parameters, and it was pre- 
trained on Common Crawl. The OpenAI GPT family models has taken pre-trained language models by storm, 
they are very powerful on realistic human text generation and many other miscellaneous NLP tasks. Therefore, 
a small amount of input text can be used to generate large amount of high-quality text, while maintaining 
semantic and syntactic understanding of each word. 


2.4. ALBERT 


A lite BERT (ALBERT) was proposed to address the problems associated with large models. It 
was specifically designed to provide contextualized natural language representations to improve the results on 
downstream tasks. However, increasing the model size to pre-train embeddings becomes harder due to memory 
limitations and longer training time. For this reason, this model arose. 

ALBERT is a lighter version of BERT, in which next sentence prediction (NSP) is replaced by sentence 
order prediction (SOP). In addition to that, it employs two parameter-reduction techniques to reduce memory 
consumption and improve training time of BERT without hurting performance: 


— Splitting the embedding matrix into two smaller matrices to easily grow the hidden size with fewer 
parameters, ALBERT separates the hidden layers size from the size of the vocabulary embedding by 
decomposing the embedding matrix of the vocabulary. 

— Repeating layers split among groups to prevent the parameter from growing with the depth of the net- 
work. 


2.5. RoBERTa 


The choice of language model hyper-parameters has a substantial impact on the final results. Hence, 
robustly optimized BERT pre-training approach (ROBERTa) is introduced to investigate the impact of 
many key hyper-parameters along with data size on model performance. ROBERTa is based on Google’s BERT 
[9] model and modifies key hyper-parameters, where the masked language modeling objective is dynamic and 
the NSP objective is removed. It is an improved version of BERT, pre-trained with much larger mini-batches 
and learning rates on a large corpus using self-supervised learning. 


2.6. XLNet 


The bidirectional property of transformer encoders, such as BERT [9], help them achieve better per- 
formance than autoregressive language modeling based approaches. Nevertheless, BERT ignores dependency 
between the positions masked, and suffers from a pretrain-finetune discrepancy when relying on corrupting the 
input with masks. In view of these pros and cons, XLNet has been proposed. XLNet is a generalized 
autoregressive pretraining approach that allows learning bidirectional dependencies by maximizing the antici- 
pated likelihood over all permutations of the factorization order. Furthermore, it overcomes the drawbacks of 
BERT [9] due to its casual or autoregressive formulation, inspired from the transformer-XL [17]. 
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2.7. DistiIBERT 

Unfortunately, the outstanding performance that comes with large-scale pretrained models is not 
cheap. In fact, operating them on edge devices under constrained computational training or inference bud- 
gets remains challenging. Against this backdrop, DistilBERT (or Distilled BERT) has seen the light to 
address the cited issues by leveraging knowledge distillation [19]. 

DistilBERT is similar to BERT, but it is smaller, faster, and cheaper. It has 40% less parameters than 
BERT base, runs 40% faster, while preserving over 95% of BERT’s performance. It is trained using distillation 
of the pretrained BERT base model. 


2.8. XLM-RoBERTa 

Pre-trained multilingual models at scale, such as multilingual BERT (mBERT) [9] and cross-lingual 
language models (XLMs) [20], have led to considerable performance improvements for a wide variety of 
cross-lingual transfer tasks, including question answering, sequence labeling, and classification. However, the 
multilingual version of ROBERTa called XLM-RoBERTa [21], pre-trained on the newly created 2.5TB 
multilingual CommonCraw!l corpus containing 100 different languages, has further pushed the performance. It 
has shown strong improvements on low-resource languages compared to previous multilingual models. 


2.9. BART 

Bidirectional and auto-regressive transformer (BART) is a generalization of BERT [9] and GPT 
[11], it takes advantage of the standard transformer [LO]. Concretely, it uses a bidirectional encoder and a 
left-to-right decoder. It is trained by corrupting text with an arbitrary noising function and learning a model to 
reconstruct the original text. BART has shown phenomenal success when fine-tuned on text generation tasks 
such as translation, but also performs well for comprehension tasks like question answering and classification. 


2.10. ConvBERT 

While BERT [9] and its variants have recently achieved incredible performance gains in many NLP 
tasks compared to previous models, BERT suffers from large computation cost and memory footprint due to 
reliance on the global self-attention block. Although all its attention heads, BERT was found to be compu- 
tationally redundant, since some heads simply need to learn local dependencies. Therefore, ConvBERT 
is a better version of BERT [9], where self-attention blocks are replaced with new mixed ones that leverage 
convolutions to better model global and local context. 


2.11. Reformer 

Consistently, large transformer models achieve state-of-the-art results in a large variety of linguis- 
tic tasks, but training them on long sequences is costly challenging. To address this issue, the Reformer 
was introduced to improve the efficiency of transformers while holding the high performance and the smooth 
training. Reformer is more efficient than transformer thanks to locality-sensitive hashing attention and 
reversible residual layers instead of the standard residuals, and axial position encoding and other optimizations. 


2.12. T5 

Transfer learning has emerged as one of the most influential techniques in NLP. Its efficiency in trans- 
ferring knowledge to downstream tasks through fine-tuning has given birth to a range of innovative approaches. 
One of these approaches is transfer learning with a unified text-to-text transformer (T5) [25], which consists 
of a bidirectional encoder and a left-to-right decoder. This approach is reshaping the transfer learning land- 
scape by leveraging the power of being pre-trained on a combination of unsupervised and supervised tasks and 
reframing every NLP task into text-to-text format. 


2.13. ELECTRA 

Masked language modeling (MLM) approaches like BERT [9] have proven to be effective when trans- 
ferred to downstream NLP tasks, although, they are expensive and require large amounts of compute. Effi- 
ciently learn an encoder that classifies token replacements accurately (ELECTRA) is a new pre-training 
approach that aims to overcome these computation problems by training two Transformer models: the gener- 
ator and the discriminator. ELECTRA trains on a replaced token detection objective, using the discriminator 
to identify which tokens were replaced by the generator in the sequences. Unlike MLM-based models, ELEC- 
TRA is defined over all input tokens rather than just a small subset that was masked, making it a more efficient 
pre-training approach. 
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2.14. Longformer 

While previous transformers were focusing on making changes to the pre-training methods, the long- 
document transformer (Longformer) comes to change the transformer’s self-attention mechanism. It has 
became the de facto standard for tackling a wide range of complex NLP tasks, with an new attention mecha- 
nism that scales linearly with sequence length, and then being able to easily process longer sequences. Long- 
former’s new attention mechanism is a drop-in replacement for the standard self-attention and combines a local 
windowed attention with a task motivated global attention. Simply, it replaces the transformer attention 
matrices with sparse matrices for higher training efficiency. 


2.15. DeBERTa 
DeBERTa stands for decoding-enhanced BERT with disentangled attention. It is a pre-training 
approach that extends Google’s BERT [9] and builds on the ROBERTa [15]. Despite being trained on only half 
of the data used for ROBERTa, DeBERTa has been able to improve the efficiency of pre-trained models through 
the use of two novel techniques: 
— Disentangled attention (DA): an attention mechanism that computes the attention weights among words 
using disentangled matrices based on two vectors that encode the content and the relative position of 
each word respectively. 


— Enhanced mask decoder (EMD): a pre-trained technique used to replace the output softmax layer. Thus, 
incorporate absolute positions in the decoding layer to predict masked tokens for model pre-training. 


3. APPROACH 

Transformer-based pre-trained language models have led to substantial performance gains, but careful 
comparison between different approaches is challenging. Therefore, we extend our study to uncover insights 
regarding their fine-tuning process and main characteristics. Our paper first aims to study the behavior of 
these models, following two approaches: a data-centric view focusing on the data state and quality, and a 
model-centric view giving more attention to the models tweaks. Indeed, we will see how data processing 
affects their performance and how adjustments and improvements made to the model over time is changing 
its performance. Thus, we seek to end with some takeaways regarding the optimal setup that aids in cross- 
validating a Transformer-based model, specifically model tuning hyper-parameters and data quality. 


3.1. Models summary 

In this section, we present the base versions’ details of the models introduced previously as shown 
in Table[A 1] We aim to provide a fair comparison based on the following criteria: L-Number of transformer 
layers, H-Hidden state size or model dimension, A-Number of attention heads, number of total parameters, 
tokenization algorithm, data used for pre-training, training devices and computational cost, training objectives, 
good performance tasks, and a short description regarding the model key points [29]. All these information 
will help to understand the performance and behaviors of different transformer-based models and aid to make 
the appropriate choice depending on the task and resources. 


3.2. Configuration 

It should be noted that we have used almost the same architecture building blocks for all our imple- 
mented models as shown in Figure [2] and Figure [3] for both encoder and decoder based models, respectively. 
In contrast, seq2seq models like BART are merely a bidirectional encoder pursued by an autoregressive de- 
coder. Each model is fed with the three required inputs, namely input ids, token type ids, and attention mask. 
However, for some models, the position embeddings are optional and can sometimes be completely ignored 
(e.g RoBERTa), for this reason we have blurred them a bit in the figures. Furthermore, it is important to note 
that we uniformed the dataset in lower cases, and we tokenized it with tokenizers based on WordPiece [BO], 
SentencePiece [31], and Byte-pair-encoding algorithms. 

In our experiments, we used a highly optimized setup using only the base version of each pre-trained 
language model. For training and validation, we set a batch size of 8 and 4, respectively, and fine-tuned the 
models for 4 epochs over the data with maximum sequence length of 384 for the intent of correspondence to 
the majority of reviews’ lengths and computational capabilities. The AdamW optimizer is utilized to optimize 
the models with a learning rate of 3e-5 and the epsilon (eps) used to improve numerical stability is set to le-6, 
which is the default value. Furthermore, the weight decay is set to 0.001, while excluding bias, LayerNorm.bias, 
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We implemented all of our models using PyTorch and transformers library from Hugging Face, and ran them 


on an NVIDIA Tesla P100-PCIE GPU-Persistence-M (51G) GPU RAM. 
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Figure 2. The architecture of the transformer encoder-based models 


3.3. Evaluation 

Dataset to fine-tune our models, we used the IMDb movie review dataset [33]. A binary sentiment 
classification dataset having 50K highly polar movie reviews labelled in a balanced way between positive 
and negative. We chose it for our study because it is often used in research studies and is a very popular 
resource for researchers working on NLP and ML tasks, particularly those related to sentiment analysis and text 
classification due to its accessibility, size, balance and pre-processing. In other words, it is easily accessible 
and widely available, with over 50K reviews well-balanced, with an equal number of positive and negative 
reviews as shown in Figure] This helps prevent biases in the trained model. Additionally, it has already been 
pre-processed with the text of each review cleaned and normalized. 

Metrics to assess the performance of the fine-tuned transformers on the IMDb movie reviews dataset, 
tracking the loss and accuracy learning curves for each model is an effective method. These curves can help 
detect incorrect predictions and potential overfitting, which are crucial factors to consider in the evaluation 
process. Moreover, widely-used metrics, namely accuracy, recall, precision, and Fl-score are valuable to 
consider when dealing with classification problems. These metrics can be defined as: 


TP TP Precisi ll 
Precision = —————_ ,__ Recall = —————— ,and Fl=2x es eee (1) 
TP+FP TP+FN Precision + Recall 
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Figure 3. The architecture of the transformer decoder-based models 
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4. RESULTS 

In this section, we present the fine-tuning main results of our implemented transformer-based lan- 
guage models on the opinion mining task on the IMDb movie reviews dataset. Typically, all the fine-tuned 
models perform well with fairly high performance, except the three autoregressive models: GPT, GPT-2, 
and Reformer, as shown in Table The best model, ELECTRA, provides an Fl-score of 95.6 points, fol- 
lowed by RoBERTa, Longformer, and DeBERTa, with an Fl-score of 95.3, 95.1, and 95.1 points, respectively. 
On the other hand, the worst model, GPT-2 provide an Fl-score of 52.9 points as shown in Figure |5| and 
Figure [6] From the results, it is clear that purely autoregressive models do not perform well on comprehension 
tasks like sentiment classification, where sequences may require access to bidirectional contexts for better word 
representation, therefore, good classification accuracy. Whereas, with autoencoding models taking advantage 
of left and right contexts, we saw good performance gains. For instance, the autoregressive XLNet model is our 
fourth best model in Table[I]with an F1 score of 94.9%, it incorporates modelling techniques from autoencod- 
ing models into autoregressive models while avoiding and addressing limitations of encoders. The code and 
fine-tuned models are available at [34]. 


Table 1. Transformer-based language models validation performance on the opinion mining IMDb dataset 


Model Recall — Precision Fl Accuracy 
BERT 93.9 94.3 94.1 94.0 
GPT 92.4 51.8 66.4 53.2 
GPT-2 51.1 54.8 52.9 54.5 
ALBERT 94.1 91.9 93.0 93.0 
RoBERTa 96.0 946 [958 [9533) 
XLNet 94.7 95.1 94.9 94.8 
DistiIBERT 94.3 92.7 93.5 93.4 
XLM-RoBERTA 83.1 71.7 77.0 75.2 
BART 96.0 93.3 94.6 94.6 
ConvBERT 95.5 93.7 94.6 94.5 
DeBERTa 95.2 95.0 oul 95.1 
ELECTRA 95.8 954 [B56] [Bae 
Longformer 95.9 94.3 oall 95.0 
Reformer 54.6 52:1 53.3 52.2 
T5 94.8 93.4 94.0 93.9 
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Figure 5. Worst model: GPT-2 loss learning curve 


5. ABLATION STUDY 

In Table [2|and Figure[7| we demonstrate the importance of configuration choices through controlled 
trials and ablation experiments. Indeed, the maximum length of the sequence and data cleaning are particularly 
crucial. Thus, to make our ablation study credible, we fine-tuned our BERT model with the same setup, 
changing only the sequence length (max-len) initially and cleaning the data (cd) at another time to observe how 
they affect the performance of the model. 
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Figure 6. Worst model: GPT-2 acc learning curve 


Table 2. Validation results of the BERT model based on different configurations, where cd stands for cleaned 
data, meaning that the latest model (BERT max-ten=384, cd) 18 trained on an exhaustively cleaned text 
Model Recall — Precision Fl Accuracy 
BERT max-lena64 86.8% 84.7% 85.8% 85.6% 
BERT max-len=384 93.9% 94.3% 94.1% 94.0% 
BERT max-len=384,cd 926% 91.6% += 92.1% = -92.2.% 
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Figure 7. Validation accuracy history of BERT model based on different configurations 


5.1. Effects of hyper-parameters 

The gap between the performance of BERT nax-ten=64 and BERT max-ten=3g4 on the IMDb dataset is an 
astounding 8.3 F1 points, as in Table[2| demonstrating how important this parameter is. Thereby, visualizing the 
distribution of tokens or words count is the ultimate solution for defining the optimal and correct value of the 
maximum length parameter that corresponds to all the training data points. Figure[8jillustrates the distribution 
of the number of tokens in the IMDb movie reviews dataset, it shows that the majority of reviews are between 
100 and 400 tokens in length. In this context, we chose 384 as the maximum length reference to study the effect 
of the maximum length parameter, because it covers the majority of review lengths while conserving memory 
and saving computational resources. It should be noted that the BERT model can process texts up to 512 tokens 
in length. It is a consequence of the model architecture and can not be adjusted directly. 


5.2. Effects of data cleaning 
Traditional machine learning algorithms require extensive data cleaning before vectorizing the input 
sequence and then feeding it to the model, with the aim of improving both reliability and quality of the data. 


Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui) 


2004 i) ISSN: 2252-8938 


Therefore, the model can only focus on important features during training. Contrarily, the performance dropped 
down dramatically by 2 F1 points when we cleaned the data for the BERT model. Indeed, the cleaning carried 
out aims to normalize the words of each review. It includes lemmatization to group together the different forms 
of the same word, stemming to reduce a word to its root, which is affixed to suffixes and prefixes, deletion of 
URLs, punctuations, and patterns that do not contribute to the sentiment, as well as the elimination of all stop 
words, except the words “no”, “nor”, and “not”, because their contribution to the sentiment can be tricky. For 
instance, “Black Panther is boring” is a negative review, but “Black Panther is not boring” is a positive review. 
This drop can be justified by the fact that BERT model and attention-based models need all the sequence words 
to better capture the meaning of words’ contexts. However, with cleaning, words may be represented differently 
from their meaning in the original sequence. Note that “not boring” and “boring” are completely different in 
meaning, but if the stop word “not” is removed, we end up with two similar sequences, which is not good in 
sentiment analysis context. 
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Figure 8. Distribution of the number of tokens for a better selection of the maximum sequence length 


5.3. Effects of bias and training data 

Carefully observing the accuracy and the loss learning curves in Figure[9]and Figure[10] we notice that 
the validation loss starts to creep upward and the validation accuracy starts to go down. In this perspective, the 
model in question continues to lose its ability to generalize well on unseen data. In fact, the model is relatively 
biased due to the effect of the training data and data-drift issues related to the fine-tuning data. In this context, 
we assume that the model starts to overfit. However, setting different dropouts, reducing the learning rate, or 
even trying larger batches will not work. On the other hand, these strategies sometimes give worst results, 
then a more critical overfitting problem. For this reason, pretraining these models on your industry data and 
vocabulary and then fine-tuning them may be the best solution. 
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Figure 9. Best model: ELECTRA loss learning curve 
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Figure 10. Best model: ELECTRA acc. learning curve 


6. CONCLUSION 

In this paper, we presented a detailed comparison to highlight the main characteristics of transformer- 
based pre-trained language models and what differentiates them from each other. Then, we studied their perfor- 
mance on the opinion mining task. Thereby, we deduce the power of fine-tuning and how it helps in leveraging 
the pre-trained models’ knowledge to achieve high accuracy on downstream tasks, even with the bias they came 
with due to the pre-training data. Experimental results show how performant these models are. We have seen 
the highest Fl-score with the ELECTRA model with 95.6 points, across the IMDb dataset. Similarly, we found 
that access to both left and right contexts is necessary when it comes to comprehension tasks like sentiment 
classification. We have seen that autoregressive models like GPT, GPT-2, and Reformer perform poorly and 
fail to achieve high accuracy. Nevertheless, XLNet has reached good results even though it is an autoregressive 
model because it incorporates ideas taken from encoders characterized by their bidirectional property. Indeed, 
all performances were nearby, including DistilBERT, which helps to gain incredible performance in less train- 
ing time thanks to knowledge distillation. For example, for 4 epochs, BERT took 70 minutes to train, while 
DistiIBERT took 35 minutes, losing only 0.6 F1 points, but saving half the time taken by BERT. Moreover, our 
ablation study shows that the maximum length of the sequence is one of the parameters having a significant im- 
pact on the final results and must be carefully analyzed and adjusted. Likewise, data quality is a must for good 
performance, data that will do not need to be processed, since extensive data cleaning processes may not help 
the model capture local and global contexts in sequences, distilled sometimes with words removed or trimmed 
during cleaning. Besides, we notice, that the majority of the models we fine-tuned on the IMDb dataset start 
to overfit at a certain number of epochs, which can lead to biased models. However, good quality data is not 
even enough, but pre-training a model on large amounts of business problem data and vocabulary may help on 
preventing it from making wrong predictions and may help on reaching a high level of generalization. 
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